Table of Content
Introduction
1.1. Library
1.2. Load data sets
1.3. Tools for Exploratory Data Analysis
Claim Evolution over the years
2.1. Claims per seasons
2.2. Temperature and snow/rain effects on the number of claims
2.3. Variation of the repair delay over the years
2.4. Temperature and snow/rain effects on the repair time
Number of claims vs. time for repair
Intersections
Photos
Population
Claim distribution per neighborhood
Time for repair per neighborhood
Specific case study
Conclusion
This notebook is part of the data story telling module. The purpose of this notebook is to explore the data extracted from:
The corresponding data files are stored in the following folders:
The methodology used to cleaned the files is described in the Data Wrangling Report (./Data Wrangling Report.pdf)
Questions
Through this notebook, the following questions will be investigated:
import pandas as pd
import numpy as np
from datetime import datetime as dt
import datetime
import math
import calendar
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import matplotlib.patches as mpatches
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import matplotlib.cbook as cbook
from scipy.stats import norm
import scipy.stats as stats
import folium
import matplotlib
import scipy
potholes_df = pd.read_csv('./Cleaned Data/Closed_Pothole_Cases_Cleaned.csv',
parse_dates=['CLOSED_DT','OPEN_DT','TARGET_DT'],
index_col=0)
weather_df = pd.read_csv('./Cleaned Data/Weather_Data_Cleaned.csv',index_col='DATE',parse_dates=True)
boston_zip_df = pd.read_csv('./Cleaned Data/Boston_Pop_Cleaned.csv',index_col='zipcode')
Confirm that the import was successful:
# Display df shapes
print(potholes_df.shape)
print(weather_df.shape)
print(boston_zip_df.shape)
# Display df structures
print(potholes_df.info())
print('----------------------')
print(weather_df.info())
print('----------------------')
print(boston_zip_df.info())
Below are presented a set of functions that are used to perform hypothesis testing and statistical exploration:
def ecdf(data):
"""Compute ECDF for a one-dimensional array of measurements."""
# Number of data points: n
n=len(data)
# x-data for the ECDF: x
x = np.sort(data)
# y-data for the ECDF: y
y = np.arange(1, n+1) / n
return x, y
def bootstrap_replicate_1d(data, func):
return func(np.random.choice(data, size=len(data)))
def draw_bs_reps(data, func, size=1):
"""Draw bootstrap replicates."""
# Initialize array of replicates: bs_replicates
bs_replicates = np.empty(size)
# Generate replicates
for i in range(size):
bs_replicates[i] = bootstrap_replicate_1d(data,func)
return bs_replicates
In order to get a sense of the efficiency of the Departement of Transportation, we are going to investigate the evolution of the number of claims over the years.
# Prepare dataframe
yearly_claim_df = potholes_df[['OPEN_DT','CASE_ENQUIRY_ID']].copy()
yearly_claim_df.OPEN_DT = yearly_claim_df.OPEN_DT.apply(lambda x: x.replace(day=(x.day//16*15+1))).dt.date
# Add season for visual inspection
season_dict = {
1: 'Winter',
2: 'Spring',
3: 'Spring',
4: 'Spring',
5: 'Summer',
6: 'Summer',
7: 'Summer',
8: 'Fall',
9: 'Fall',
10: 'Fall',
11: 'Winter',
12: 'Winter',
}
yearly_claim_df = yearly_claim_df.groupby('OPEN_DT').count()
yearly_claim_df['Season'] = yearly_claim_df.index.map(lambda x: season_dict[x.month])
yearly_claim_df.head()
yearly_claim_season_df = yearly_claim_df.groupby('Season').sum()
yearly_claim_season_df
# Extract the seasonal data
spring_claims = yearly_claim_df[yearly_claim_df.Season=="Spring"].CASE_ENQUIRY_ID
summer_claims = yearly_claim_df[yearly_claim_df.Season=="Summer"].CASE_ENQUIRY_ID
fall_claims = yearly_claim_df[yearly_claim_df.Season=="Fall"].CASE_ENQUIRY_ID
winter_claims = yearly_claim_df[yearly_claim_df.Season=="Winter"].CASE_ENQUIRY_ID
# Print results
print("Spring:")
print(spring_claims.describe())
print("***********************")
print("Summner")
print(summer_claims.describe())
print("***********************")
print("Fall")
print(fall_claims.describe())
print("***********************")
print("Winter")
print(winter_claims.describe())
sns.set_context("notebook", font_scale=1.3, rc={"lines.linewidth": 2.2})
sns.set_style("whitegrid")
fig, ax = plt.subplots(figsize=(16,6))
sns.boxplot(x='Season',y="CASE_ENQUIRY_ID",
data=yearly_claim_df,
palette = sns.color_palette(['red','darkorange','c','limegreen']))
ax.set(xlabel="Season",ylabel='Monthly claim count')
plt.title('Number of claims per season', fontsize=14);
As shown above, the number of monthly claims peaks in spring. The other three seasons have a relatively similar number of claims. Common sense would tell you that the number of claims should be maximum in the winter. Based on this discovery, our assumption is that the number of claims is correlated with the winter weather. However, this correlation is not direct and presents a time lag. In the next sections of this report, we will try to quantify the time shift.
# Create x labels using list comprehension
x_label = [str(x.year)+"-"+str(x.month) if x.day==1 and x.month==1 else
(x.month) if x.day==1 and x.month%2==1 else '' for x in yearly_claim_df.index]
# Plot
fig, ax = plt.subplots(figsize=(16,10))
ax = sns.barplot(x=yearly_claim_df.index,
y=yearly_claim_df.CASE_ENQUIRY_ID,hue=yearly_claim_df.Season,
palette = sns.color_palette(['red','darkorange','c','limegreen']))
# Set plot labels
ax.set_title("Number of claims (bi-monthly)")
ax.set(xlabel="",ylabel='Claim count')
ax.set_xticklabels(x_label,rotation=45)
plt.savefig('./Figures/Claims_Per_Season.jpg',bbox_inches='tight')
plt.show()
As shown above, the maximum number of monthly claims typically occurs in March (first month of spring). We can also observe that the tougher the winter the larger the number of claims. Indeed, the winters of 2014, 2015, and 2017 were worse than the ones of the other years included in the data set. This observation reinforces our assumption that the number of monthly claims is correlated with the intensity of the winter.
# Group the number of reported cases by month
yearly_claim_df.index = pd.to_datetime(yearly_claim_df.index)
to_plot = yearly_claim_df.groupby(by=yearly_claim_df.index.month).sum()
to_plot['Month_Name'] = to_plot.index
to_plot.Month_Name = to_plot.Month_Name.apply(lambda x: calendar.month_abbr[x])
to_plot.set_index(to_plot.Month_Name,inplace=True,drop=True)
to_plot.plot(figsize=(16,10),kind='bar',rot=45,color='blueviolet',alpha=0.5);
plt.xlabel('Month')
plt.ylabel('Number of claims')
plt.title('Number of claims per month [2011 to 2017]')
plt.legend().set_visible(False)
plt.show();
As expected, more potholes appear after a the winter. Based on the above plots, the number of claims is maximum over teh January to April time period. We can make two conclusions:
In order to validate the second claim, we will use the plot above and we will superimpose the snowfall and temperature variation.
# Extract seasonal records
spring_claims = yearly_claim_df[yearly_claim_df.Season=="Spring"].CASE_ENQUIRY_ID.values
summer_claims = yearly_claim_df[yearly_claim_df.Season=="Summer"].CASE_ENQUIRY_ID.values
fall_claims = yearly_claim_df[yearly_claim_df.Season=="Fall"].CASE_ENQUIRY_ID.values
winter_claims = yearly_claim_df[yearly_claim_df.Season=="Winter"].CASE_ENQUIRY_ID.values
# Normalized the data so it can be compared
spring_claims_norm = (spring_claims-np.mean(spring_claims))/np.std(spring_claims)
summer_claims_norm = (summer_claims-np.mean(summer_claims))/np.std(summer_claims)
fall_claims_norm = (fall_claims-np.mean(fall_claims))/np.std(fall_claims)
winter_claims_norm = (winter_claims-np.mean(winter_claims))/np.std(winter_claims)
# Compute ECDF
x_spring, y_spring = ecdf(spring_claims_norm)
x_summer, y_summer = ecdf(summer_claims_norm)
x_fall, y_fall = ecdf(fall_claims_norm)
x_winter, y_winter = ecdf(winter_claims_norm)
# Normal EDCF
x_norm = np.linspace(-2.0,3.0,num=50)
y_norm = norm.cdf(x_norm)
plt.figure(figsize=(16,10))
plt.plot(x_spring,y_spring,marker='.',linestyle='-',color='limegreen',label='Spring')
plt.plot(x_summer,y_summer,marker='.',linestyle='-',color='red',label='Summer')
plt.plot(x_fall,y_fall,marker='.',linestyle='-',color='darkorange',label='Fall')
plt.plot(x_winter,y_winter,marker='.',linestyle='-',color='c',label='Winter')
plt.plot(x_norm,y_norm,marker='.',linestyle='-',color='k',label='Normal Dist.')
plt.margins(0.02)
plt.legend()
plt.xlabel('Monthly claim count')
plt.ylabel('ECDF')
plt.title('Normalized ECDF - Monthly claim count per season')
plt.savefig('./Figures/ECDF_Count.jpg',bbox_inches='tight')
plt.show()
Analysis: Based on the plot shown above, it seems that only the spring bi-monthly claim counts follow a normal distribution. In order to validate our observation, we will perform a normal test using the scipy library:
Null hypothesis: Ho -> The claim count is normally distributed.
Alternative hypothesis Ha -> The claim count is not normally distributed.
We will use a 0.05 cut-off value to make our conclusion.
# Print results
print("Spring:")
print(stats.normaltest(spring_claims))
print("********************************************************************************************")
print("Summner")
print(stats.normaltest(summer_claims))
print("********************************************************************************************")
print("Fall")
print(stats.normaltest(fall_claims))
print("********************************************************************************************")
print("Winter")
print(stats.normaltest(winter_claims))
Based on the p-values listed above, the spring and winter distributions can be assumed to be normally distributed.
As we saw previously, it seems that there is a small lage between the number of potholes during a month and how freezing the previous months were. In order to estimate the lag, we will first plot some weather data in order to indentify which month were the ones with the largest number of freezing days and the ones with the largest snowfalls.
# Create a custom data frame to plot the data
to_plot = weather_df.loc['2011':,['DT32','DSNW']]
# Group the data by mean
to_plot = weather_df[['DT32','DSNW']].groupby(by=weather_df.index.month).mean()
# Extract month number for plotting purpose
to_plot['Month_Name'] = to_plot.index
to_plot.Month_Name = to_plot.Month_Name.apply(lambda x: calendar.month_abbr[x])
to_plot.set_index(to_plot.Month_Name,inplace=True,drop=True)
to_plot[['DT32','DSNW']].plot(figsize=(16,10),kind='bar',rot=45,color=['lightskyblue','silver'],alpha=0.7)
plt.xlabel('Month')
plt.ylabel('Average number of days')
plt.title('Average monthly number of freezing and snow days')
plt.legend(['Number of snow days','Number of freezing days'])
plt.savefig('./Figures/Snow_Freeze.jpg',bbox_inches='tight')
plt.show();
As shown above, the months of January, February, and December are the ones with the largest number of both freezing and snow days. Since the pothole request count peaks in March, we should expect to see the maximum correlation between the weather data (DSNW and DT32) and the number of claims when the weather data is shifted forward 1 or 2 months.
# Prepare dataframe
yearly_claim_offset_df = potholes_df[['OPEN_DT','CASE_ENQUIRY_ID']].copy()
yearly_claim_offset_df.OPEN_DT = yearly_claim_offset_df.OPEN_DT.apply(lambda x: x.replace(day=1)).dt.date
# Add season for visual inspection
season_dict = {
1: 'Winter',
2: 'Spring',
3: 'Spring',
4: 'Spring',
5: 'Summer',
6: 'Summer',
7: 'Summer',
8: 'Fall',
9: 'Fall',
10: 'Fall',
11: 'Winter',
12: 'Winter',
}
# We create the season feature
yearly_claim_offset_df = yearly_claim_offset_df.groupby('OPEN_DT').count()
yearly_claim_offset_df['Season'] = yearly_claim_offset_df.index.map(lambda x: season_dict[x.month])
yearly_claim_offset_df.head()
# Weather data
#weather_offset_df = weather_df.loc[:,['DSNW','DT32','TAVG',"DP10"]]
weather_offset_df = weather_df.copy()
# We create new features used to shift the number of claims one month at the time in the past
for shift in range(-1,-6,-1): # 1 to 6 months
yearly_claim_offset_df["CASE_ENQUIRY_ID_Shift_"+str(shift)] = yearly_claim_offset_df.CASE_ENQUIRY_ID.shift(shift)
# Merge pothole and weather data
yearly_claim_offset_df = yearly_claim_offset_df.merge(right=weather_offset_df,how='left',left_index=True,right_index=True)
# Compute correlation matrix and trim the unecessary values
correlation = yearly_claim_offset_df.corr()
correlation=correlation.loc[correlation.columns.str.contains('CASE'),~correlation.index.str.contains('CASE')]
correlation
The correlation matrix shown below contains certain high values, however, the matrix representation is not the most convenient. We now use Seaborn heatmap to visualize the correlation matrix. Note that we only display the coefficient of correlation greater than 0.6 in absolute value.
# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(20, 6))
# Generate a custom diverging colormap
cmap = sns.diverging_palette(220, 10, as_cmap=True)
# Draw the heatmap with the mask and correct aspect ratio
ax = plt.axes()
sns.heatmap(correlation, cmap=cmap,square=True, linewidths=.5,annot=True)
ax.set_title("Period offset study - Monthly offset")
for text in ax.texts:
if math.fabs(float(text.get_text()))>0.6:
text.set_size(15)
else:
text.set_size(0)
plt.savefig('./Figures/Heat_Weather_Claims.jpg',bbox_inches='tight')
From the study of the heatmap, we can draw the following conclusions:
However, the correlation needs to be interpreted with caution, indeed, the precipitation data (snow or rain) are highly correlated with the temperatures. The plot shown below provides the correlation between the weather data.
# Create mask
mask = np.zeros_like(weather_df.corr())
mask[np.triu_indices_from(mask)] = True
# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(20, 20))
# Generate a custom diverging colormap
cmap = sns.diverging_palette(220, 10, as_cmap=True)
# Draw the heatmap with the mask and correct aspect ratio
ax = plt.axes()
sns.heatmap(weather_df.corr(), cmap=cmap,square=True, linewidths=.5,annot=True,mask=mask)
ax.set_title("Weather data correlation")
for text in ax.texts:
if math.fabs(float(text.get_text()))>0.6:
text.set_size(15)
else:
text.set_size(0)
Now that we have a better understanding of the overall data correlation, we can investigate specific relationships using the (-1) month offset.
yearly_claim_offset_df.dropna(axis=0,subset=['CASE_ENQUIRY_ID_Shift_-1'],inplace=True)
fig, ax = plt.subplots(figsize=(16,6))
sns.regplot(x='DSNW',y="CASE_ENQUIRY_ID_Shift_-1",data=yearly_claim_offset_df,color="royalblue")
ax.set_title("Effect of the number of snow days (offset=1 month)")
ax.set(xlabel="Number of snow day per month",ylabel='Claim count')
plt.savefig('./Figures/1_Reg_DSNW_Count.jpg',bbox_inches='tight');
slope,intercept = np.polyfit(yearly_claim_offset_df.DSNW.values,
yearly_claim_offset_df["CASE_ENQUIRY_ID_Shift_-1"].values,
deg=1)
print("Slope:",slope)
print("Intercept",intercept)
Note
As expected, there is a clear positive correlation (0.79) between the frequency of snowfall over a month and the number of claims created the next month. It is interesting to notice the group of data point corresponding to a zero day snow fall. We will have to use other variables to investigate these cases specifically.
fig, ax = plt.subplots(figsize=(16,6))
sns.regplot(x='DT32',y="CASE_ENQUIRY_ID_Shift_-1",data=yearly_claim_offset_df,color="royalblue")
ax.set_title("Effect of the number of freezing days (offset=1 month)")
ax.set(xlabel="Number of freezing day per month",ylabel='Claim count')
plt.savefig('./Figures/2_Reg_DT32_Count.jpg',bbox_inches='tight');
slope,intercept = np.polyfit(yearly_claim_offset_df.DT32.values,
yearly_claim_offset_df["CASE_ENQUIRY_ID_Shift_-1"].values,
deg=1)
print("Slope:",slope)
print("Intercept",intercept)
Note
As expected, there is a clear positive correlation (0.74) between the frequency of freezing days over a month and the number of claims created the next month. However, we have to be careful because the number of snow days is also positively correlated (0.79) with the number of freezing days (See plot below).
fig, ax = plt.subplots(figsize=(16,6))
sns.regplot(x='DT32',y="DSNW",data=weather_df,color="royalblue")
ax.set_title("Correlation between snowfall and number of freezing days")
ax.set(xlabel="Number of freezing day per month",ylabel='Number of snow day per month')
plt.savefig('./Figures/3_Reg_DT32_DSNW.jpg',bbox_inches='tight');
slope,intercept = np.polyfit(yearly_claim_offset_df.DT32.values,
yearly_claim_offset_df.DSNW.values,
deg=1)
print("Slope:",slope)
print("Intercept",intercept)
fig, ax = plt.subplots(figsize=(16,6))
sns.regplot(x='TAVG',y="CASE_ENQUIRY_ID_Shift_-1",data=yearly_claim_offset_df,color="orangered")
ax.set_title("Effect of the daily averange temperature (offset=1 month)")
ax.set(xlabel="Monthly average temperature",ylabel='Claim count')
plt.savefig('./Figures/4_Reg_TAVG_COUNT.jpg',bbox_inches='tight');
slope,intercept = np.polyfit(yearly_claim_offset_df.TAVG.values,
yearly_claim_offset_df["CASE_ENQUIRY_ID_Shift_-1"].values,
deg=1)
print("Slope:",slope)
print("Intercept",intercept)
Note
As expected, there is a negative correlation (-0.71) between the average monthly temperature and the number of claims created the next month. The data points are closer to the regression line for higher average temperature. This can be explained by the effects of the snow at lower temperatures. We can interfere that for two months with the same number of freezing days and the same average temperature, the month with the highest snowfall will produce more claims. The residual plot below confirms the data point distribution.
fig, ax = plt.subplots(figsize=(16,6))
sns.residplot(x='TAVG',y="CASE_ENQUIRY_ID_Shift_-1",data=yearly_claim_offset_df,color="orangered")
ax.set_title("Residual - Effect of the daily averange temperature (offset=1 month)")
ax.set(xlabel="Monthly average temperature",ylabel='Monthly claim count')
plt.savefig('./Figures/5_Res_TAVG_COUNT.jpg',bbox_inches='tight');
The plot presented above depicts the residuals of the regression between the monthly average temperature and the monthly number of claims. It is interesting to notice that the residual seems to decrease (in absolute value) as the monthly average temperature decreases. Our assumption is that the higher variance at lower temperature (near freezing) is probably explained by the amount of snow. We also prove that the number of claims is positively correlated with the number of snow days per month.
fig, ax = plt.subplots(figsize=(16,6))
sns.regplot(x='DP10',y="CASE_ENQUIRY_ID",data=yearly_claim_offset_df,color="navy")
ax.set_title("Effect of the precipication (offset=1 month)")
ax.set(xlabel="Number of days with more that one inch rain fall",ylabel='Claim count')
plt.savefig('./Figures/6_Reg_DP10_COUNT.jpg',bbox_inches='tight');
slope,intercept = np.polyfit(yearly_claim_offset_df.DP10.values,
yearly_claim_offset_df["CASE_ENQUIRY_ID_Shift_-1"].values,
deg=1)
print("Slope:",slope)
print("Intercept",intercept)
Note
The correlation (0.26) between precipitation and claims is not really obvious. In conclusion, it seems that only the snow is strongly correlated with the number of claims.
# Scatter plot of the monthly number of claims as a function of the snowfall and freezing days
fig, ax = plt.subplots(figsize=(16,6))
plt.scatter(x=yearly_claim_offset_df['DT32'],
y=yearly_claim_offset_df['DSNW'],
s=yearly_claim_offset_df['CASE_ENQUIRY_ID'],
c = 'dodgerblue',alpha = 0.5)
ax.set_title("Monthly number of claims (offset=1 month)")
ax.set(xlabel="Number of freezing day per month",ylabel='Number of days with more that one inch snow fall');
plt.legend(handles=[mpatches.Patch(color='dodgerblue', alpha=0.5,label='Monthly claims')]);
As shown above, the correlation between the number of claims and the number of freezing days and number of snow days is not exactly linear. Indeed, the bilinear relationship develops after 15 freezing days.
Now that we know the cold weather has a significant impact on the road damage frequency, we will be looking at the impact of the weather on the repair time. The repair time is defined as the difference between the claim closure date and the claim creation date. As previously stated, the pothole database can be modified by making a request to 311 or by city workers. A significant number of potholes are discovered and fixed at the same time by city workers patrolling in the street. In order to only focus on claims made by "normal" users, we will restrain the data set to potholes that took more than half a day to be fixed.
# Prepare dataframe
yearly_repair_time_df = potholes_df[['OPEN_DT','time_repair']].copy()
yearly_repair_time_df = yearly_repair_time_df[yearly_repair_time_df.time_repair>=0.5]
# Re-adjust the OPEN_DT to either the first day of the month or the 15th (for plotting purpose)
yearly_repair_time_df.OPEN_DT = yearly_repair_time_df.OPEN_DT.apply(lambda x: x.replace(day=(x.day//16*15+1))).dt.date
# Add season for visual inspection
season_dict = {
1: 'Winter',
2: 'Spring',
3: 'Spring',
4: 'Spring',
5: 'Summer',
6: 'Summer',
7: 'Summer',
8: 'Fall',
9: 'Fall',
10: 'Fall',
11: 'Winter',
12: 'Winter',
}
yearly_repair_time_df = yearly_repair_time_df.groupby('OPEN_DT').median()
yearly_repair_time_df['Season'] = yearly_repair_time_df.index.map(lambda x: season_dict[x.month])
yearly_repair_time_df.head()
yearly_repair_time_df.groupby('Season').median()
sns.set_context("notebook", font_scale=1.3, rc={"lines.linewidth": 2.2})
fig, ax = plt.subplots(figsize=(18,15))
sns.boxplot(x='Season',y="time_repair",
data=yearly_repair_time_df,
palette = sns.color_palette(['red','darkorange','c','limegreen']))
ax.set(xlabel="Season",ylabel='Delay to repair')
plt.title('Median number of days to repair a pothole', fontsize=14);
With only a few outliers and a median time of 2 days to fix a pothole, the city has a good response time. Moreover, the response time barely varies from one season to the other, which indicates that the city responds well to pothole claims even during the winter.
# Set main plot parameters
sns.set_context("notebook", font_scale=1.3, rc={"lines.linewidth": 2.2})
# Create x labels using list comprehension
yearly_claim_df.index = pd.to_datetime(yearly_claim_df.index)
x_label = [str(x.year)+"-"+str(x.month) if x.day==1 and x.month==1 else
(x.month) if x.day==1 and x.month%2==1 else '' for x in yearly_claim_df.index]
# Plot
fig, ax = plt.subplots(figsize=(16,8))
ax = sns.barplot(x=yearly_repair_time_df.index,
y=yearly_repair_time_df.time_repair,hue=yearly_claim_df.Season,
palette = sns.color_palette(['red','darkorange','c','limegreen']))
# Set plot labels
ax.set_title("Median number of days to repair a pothole (bi-monthly)")
ax.set(xlabel="",ylabel='Repair time')
ax.set_xticklabels(x_label,rotation=45)
plt.savefig('./Figures/90_Plt_Repair_Season.jpg',bbox_inches='tight');
plt.show()
Obviously, the first month of the spring of 2015 is an outlier. Let's investigate why the time to repair was so long. Before looking at the data for this specific month, we can make a hypothesis, this increase in time repair is probably due to the snow falls. Indeed, New England experienced the worst winter in decades. The accumulation of snow, the long lasting freezing temperatures prevented the city to fix the potholes.
yearly_repair_time_df.head()
# Extract seasonal records
spring_claims = yearly_repair_time_df[yearly_repair_time_df.Season=="Spring"].time_repair.values
summer_claims = yearly_repair_time_df[yearly_repair_time_df.Season=="Summer"].time_repair.values
fall_claims = yearly_repair_time_df[yearly_repair_time_df.Season=="Fall"].time_repair.values
winter_claims = yearly_repair_time_df[yearly_repair_time_df.Season=="Winter"].time_repair.values
# Compute ECDF
x_spring, y_spring = ecdf(spring_claims)
x_summer, y_summer = ecdf(summer_claims)
x_fall, y_fall = ecdf(fall_claims)
x_winter, y_winter = ecdf(winter_claims)
plt.figure(figsize=(16,10))
plt.plot(x_spring,y_spring,marker='.',linestyle='-',color='limegreen',label='Spring')
plt.plot(x_summer,y_summer,marker='.',linestyle='-',color='red',label='Summer')
plt.plot(x_fall,y_fall,marker='.',linestyle='-',color='darkorange',label='Fall')
plt.plot(x_winter,y_winter,marker='.',linestyle='-',color='c',label='Winter')
plt.margins(0.02)
plt.legend()
plt.xlabel('Median repair time (days)')
plt.ylabel('ECDF')
plt.title('ECDF - Monthly median repair time per season')
plt.savefig('./Figures/90_ECDF_Season_Time.jpg',bbox_inches='tight');
plt.show()
Analysis: The ECDF of the four seasons are relatively similar with the exeption of a few outliers for the spring data.
claims_feb_2015_df=potholes_df[(potholes_df.OPEN_DT>datetime.date(2015,2,1)) & (potholes_df.OPEN_DT<datetime.date(2015,3,1)) & (potholes_df.time_repair>0.5)]
claims_feb_2014_df=potholes_df[(potholes_df.OPEN_DT>datetime.date(2014,2,1)) & (potholes_df.OPEN_DT<datetime.date(2014,3,1)) & (potholes_df.time_repair>0.5)]
# Set main plot parameters
fig, ax = plt.subplots(figsize=(16,7))
sns.set(color_codes=True)
sns.set(palette="muted")
sns.distplot(claims_feb_2015_df.time_repair,
kde=False,hist_kws=dict(edgecolor="black"),color='dodgerblue',
bins = np.linspace(0, 80, 17),label="February 2015")
sns.distplot(claims_feb_2014_df.time_repair,
kde=False,hist_kws=dict(edgecolor="black"),color='orange',
bins = np.linspace(0, 80, 17),label="February 2014")
ax.legend()
# Set plot labels
ax.set_title("Repair time for February 2014 and 2015 (Repair faster than 12h excluded)")
ax.set(xlabel="Time for repair in days",ylabel='Claim count')
plt.savefig('./Figures/92_Repair_Count.jpg',bbox_inches='tight');
As shown above, the distribution of the repair time over the month of February 2015 is spread from 0 to 80 days with a linear decrease. The data for the month of February 2014 is mostly contained in the 0 to 5 day-bin. There are two factors that can explain the difference:
Just like we did for the monthly number of claims, we will evaluate the impact of the weather on the repair delay.
# Prepare dataframe
yearly_repair_time_df = potholes_df[['OPEN_DT','time_repair','CASE_ENQUIRY_ID']].copy()
yearly_repair_time_df = yearly_repair_time_df[yearly_repair_time_df.time_repair>=0.5]
yearly_repair_time_df.OPEN_DT = yearly_repair_time_df.OPEN_DT.apply(lambda x: x.replace(day=1)).dt.date
# Filter the month of february 2015
yearly_repair_time_df = yearly_repair_time_df[yearly_repair_time_df.OPEN_DT!=datetime.date(2015,2,1)]
# Add season for visual inspection
season_dict = {
1: 'Winter',
2: 'Spring',
3: 'Spring',
4: 'Spring',
5: 'Summer',
6: 'Summer',
7: 'Summer',
8: 'Fall',
9: 'Fall',
10: 'Fall',
11: 'Winter',
12: 'Winter',
}
# We create the season feature
yearly_repair_time_df['Season'] = yearly_repair_time_df.OPEN_DT.map(lambda x: season_dict[x.month])
yearly_repair_time_df = yearly_repair_time_df.groupby('OPEN_DT').median()
# Weather data
#weather_repair_df = weather_df.loc[:,['DSNW','DT32','TAVG',"DP10"]]
weather_repair_df = weather_df.copy()
# Merge pothole and weather data
yearly_repair_time_df = yearly_repair_time_df.merge(right=weather_repair_df,how='left',left_index=True,right_index=True)
yearly_repair_time_df.tail()
# Compute correlation matrix and trim the unecessary values
correlation = yearly_repair_time_df.corr()
correlation=correlation.loc[correlation.columns.str.contains('time_repair'),~correlation.index.str.contains('time_repair')]
correlation
# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(20, 6))
# Generate a custom diverging colormap
cmap = sns.diverging_palette(220, 10, as_cmap=True)
# Draw the heatmap with the mask and correct aspect ratio
ax = plt.axes()
sns.heatmap(correlation, cmap=cmap,square=True, linewidths=.5,annot=True,annot_kws={"size":12})
plt.savefig('./Figures/90_Plt_Heat_Weather_Time.jpg',bbox_inches='tight');
ax.set_title("Repair time - Correlation study");
As shown above, there is no clear correlation between the weather data and the repair time. This confirms that except a few outliers, the city has a fast and consistent response when it comes to fixing potholes reported through 3-1-1 calls.
sns.set_context("notebook", font_scale=1.3, rc={"lines.linewidth": 2.2})
sns.set_style("whitegrid")
fig, ax = plt.subplots(figsize=(16,6))
sns.set_style("whitegrid")
sns.regplot(x='DSNW',y="time_repair",data=yearly_repair_time_df,color="royalblue")
ax.set_title("Effect of the number of snow days (outlier excluded)")
ax.set(xlabel="Number of snow day per month",ylabel='Median repair time')
plt.savefig('./Figures/7_Reg_DSNW_TIME.jpg',bbox_inches='tight');
slope,intercept = np.polyfit(yearly_repair_time_df.DSNW.values,
yearly_repair_time_df.time_repair.values,
deg=1)
print("Slope:",slope)
print("Intercept",intercept)
There is no clear correlation between the number of snow days and the time to repair a pothole.
fig, ax = plt.subplots(figsize=(16,6))
sns.regplot(x='DT32',y="time_repair",data=yearly_repair_time_df,color="royalblue")
ax.set_title("Effect of the number of freezing days (outlier excluded)")
ax.set(xlabel="Number of freezing day per month",ylabel='Median repair time')
plt.savefig('./Figures/8_Reg_DT32_TIME.jpg',bbox_inches='tight');
slope,intercept = np.polyfit(yearly_repair_time_df.DT32.values,
yearly_repair_time_df.time_repair.values,
deg=1)
print("Slope:",slope)
print("Intercept",intercept)
There is no clear correlation between the number of freezing day and the time to repair a pothole.
fig, ax = plt.subplots(figsize=(16,6))
sns.regplot(x='TAVG',y="time_repair",data=yearly_repair_time_df,color="orangered")
ax.set_title("Effect of the daily averange temperature (outlier excluded)")
ax.set(xlabel="Monthly average temperature",ylabel='Median repair time')
plt.savefig('./Figures/9_Reg_TAVG_TIME.jpg',bbox_inches='tight');
slope,intercept = np.polyfit(yearly_repair_time_df.TAVG.values,
yearly_repair_time_df.time_repair.values,
deg=1)
print("Slope:",slope)
print("Intercept",intercept)
There is no clear correlation between the average temperature and the time to repair a pothole.
fig, ax = plt.subplots(figsize=(16,6))
sns.regplot(x='DP10',y="time_repair",data=yearly_repair_time_df,color="navy")
ax.set_title("Effect of the precipication (outlier excluded)")
ax.set(xlabel="Number of days with more that one inch rain fall",ylabel='Median repair time')
plt.savefig('./Figures/10_Reg_DP10_TIME.jpg',bbox_inches='tight');
slope,intercept = np.polyfit(yearly_repair_time_df.DP10.values,
yearly_repair_time_df.time_repair.values,
deg=1)
print("Slope:",slope)
print("Intercept",intercept)
There is no clear correlation between the number of rainy days and the time to repair a pothole.
# Scatter plot of the monthly number of claims as a function of the snowfall and freezing days
fig, ax = plt.subplots(figsize=(16,6))
plt.scatter(x=yearly_repair_time_df['DT32'],
y=yearly_repair_time_df['DSNW'],
s=yearly_repair_time_df['time_repair']*200,
c = 'dodgerblue',alpha = 0.5)
ax.set_title("Median repair time")
ax.set(xlabel="Number of freezing day per month",ylabel='Number of days with more that one inch snow fall');
plt.legend(handles=[mpatches.Patch(color='dodgerblue', alpha=0.5,label='Monthly claims')]);
Again, there is no clear correlation. In conclusion, it seems that the repair time is not impacted by the weather conditions.
In this section, we will investigate if the time to repair a pothole and the number of claims are correlated. We will be working using monthly periods.
# Prepare dataframe
repair_vs_count_df = potholes_df[['OPEN_DT','time_repair','CASE_ENQUIRY_ID']].copy()
repair_vs_count_df = repair_vs_count_df[repair_vs_count_df.time_repair>=0.5]
# Re-adjust the OPEN_DT to the first day of the month
repair_vs_count_df.OPEN_DT = repair_vs_count_df.OPEN_DT.apply(lambda x: x.replace(day=1)).dt.date
# Filter the month of february 2015
repair_vs_count_df = repair_vs_count_df[repair_vs_count_df.OPEN_DT!=datetime.date(2015,2,1)]
repair_vs_count_df = repair_vs_count_df.groupby('OPEN_DT').agg({'time_repair':['median'], 'CASE_ENQUIRY_ID':['count']})
repair_vs_count_df.head()
# Scatter plot of the monthly number of claims as a function of the snowfall and freezing days
fig, ax = plt.subplots(figsize=(16,6))
sns.regplot(y=repair_vs_count_df[('time_repair', 'median')],
x=repair_vs_count_df[('CASE_ENQUIRY_ID', 'count')])
ax.set_title("Median repair time vs number of claims")
ax.set(ylabel="Median repair time in days",xlabel='Average number of monthly claims')
plt.savefig('./Figures/11_Reg_TIME_CLAIM.jpg',bbox_inches='tight');
slope,intercept = np.polyfit(repair_vs_count_df[('CASE_ENQUIRY_ID', 'count')].values,
repair_vs_count_df[('time_repair', 'median')].values,
deg=1)
print("Slope:",slope)
print("Intercept",intercept)
repair_vs_count_df.corr()
As shown above, there is no clear positive correlation (0.21) between the two features.
# Prepare dataframe
intersection_df = potholes_df[['CASE_ENQUIRY_ID','OPEN_DT','is_intersection']].copy()
intersection_df['OPEN_DT'] = intersection_df.OPEN_DT.apply(lambda x: x.year)
intersection_df = intersection_df.groupby(['OPEN_DT','is_intersection']).count()
intersection_df
As shown above, the number of potholes located within intersection is extremely high compared to the proportion of road that constitutes intersections. This can be explained by the failure mechanism of the top layer of the road. Indeed, the asphalt works well in compression but is not as good to support shear loads which happen when a car changes direction. If we conservatively assume that intersection accounts for 10% of the roads in the city, while they account for ~33% of the pothole claims.
In this section, we will investigate whether or not attaching a photo to a claim speeds up the repair.
We perform a hypothesis testing with the following hypotheses:
Null hypothesis: Ho -> The mean repair time is the same for cases with picture or without picture.
Alternative hypothesis: Ha -> The mean repair time for cases with pictures is different from the ones without pictures.
In this case, we do not make any assumptions on the distribution of the mean repair time.
# Prepare dataframe
repair_vs_count_df = potholes_df[['time_repair','SubmittedPhoto_Bool']].copy()
repair_vs_count_df = repair_vs_count_df[repair_vs_count_df.time_repair>=0.5]
# Compute difference of mean
empirical_diff_means = np.mean( repair_vs_count_df[repair_vs_count_df.SubmittedPhoto_Bool].time_repair)-np.mean(repair_vs_count_df[~repair_vs_count_df.SubmittedPhoto_Bool].time_repair)
print("Mean with picture:",np.mean( repair_vs_count_df[repair_vs_count_df.SubmittedPhoto_Bool].time_repair))
print("Mean without picture:",np.mean( repair_vs_count_df[~repair_vs_count_df.SubmittedPhoto_Bool].time_repair))
print("Empirical mean difference:",empirical_diff_means)
# Compute mean repair time for all cases
mean_repair_time = np.mean(repair_vs_count_df.time_repair.values)
print("Mean repair time in days:",mean_repair_time)
# Generate arrays of repair times
array_repair_nopic = repair_vs_count_df[repair_vs_count_df.SubmittedPhoto_Bool].time_repair
array_repair_pic = repair_vs_count_df[~repair_vs_count_df.SubmittedPhoto_Bool].time_repair
# Generate shifted arrays based on null hypothesis
shifted_repair_nopic = array_repair_nopic-np.mean(array_repair_nopic)+mean_repair_time
shifted_repair_pic = array_repair_pic-np.mean(array_repair_pic)+mean_repair_time
# Generate 10,000 boostrap replicates from the shifted arrays
bs_replicates_pic = draw_bs_reps(array_repair_nopic, np.mean, 100000)
bs_replicates_nopic = draw_bs_reps(array_repair_pic, np.mean, 100000)
# Get replicates of difference of means: bs_replicates
bs_replicates = bs_replicates_pic-bs_replicates_nopic
# Compute and print p-value: p
p = np.sum(bs_replicates>=empirical_diff_means) / float(len(bs_replicates))
print('p-value =', p)
The p-value of the test is large, therefore, we do not reject the null hypothesis and conclude that taking a picture or not has no effect on the repair time.
In this section, we will investigate if the number of claims is correlated with the number of people living in a neighborhood but also with the neighborhood area.
# Prepare dataframe
pop_claim_df = potholes_df[['CASE_ENQUIRY_ID','time_repair','LOCATION_ZIPCODE']].copy()
pop_claim_df = pop_claim_df[pop_claim_df.time_repair>=0.5]
# The data is grouped
pop_claim_df = pop_claim_df.groupby('LOCATION_ZIPCODE').agg({'time_repair':['median'], 'CASE_ENQUIRY_ID':['count']})
# Merge pothole and city data
pop_claim_df = pop_claim_df.merge(right=boston_zip_df,how='left',left_index=True,right_index=True)
pop_claim_df.tail()
# Compute correlation matrix and trim the unecessary values
correlation = pop_claim_df.corr()
correlation=correlation.loc[[('time_repair','median'),('CASE_ENQUIRY_ID','count')],
~correlation.columns.isin([('CASE_ENQUIRY_ID','count'),('time_repair','median'),'Latitude','Longitude'])]
correlation
# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(10, 4))
# Generate a custom diverging colormap
cmap = sns.diverging_palette(220, 10, as_cmap=True)
# Draw the heatmap with the mask and correct aspect ratio
ax = plt.axes()
sns.heatmap(correlation, cmap=cmap,square=True, linewidths=.5,annot=True,annot_kws={"size":15})
ax.set_title("Repair time - Correlation study")
plt.savefig('./Figures/12_Pop_Heat.jpg',bbox_inches='tight');
The correlation study shows the following relationships:
fig, ax = plt.subplots(figsize=(16,6))
sns.regplot(x=pop_claim_df['population'],y=pop_claim_df[("CASE_ENQUIRY_ID", "count")],color="darkviolet")
ax.set_title("Neighborhood population vs number of pothole claims")
ax.set(xlabel="Neighborhood population",ylabel='Number of claim')
plt.savefig('./Figures/13_Pop_Count.jpg',bbox_inches='tight');
slope,intercept = np.polyfit(pop_claim_df['population'].values,
pop_claim_df[("CASE_ENQUIRY_ID", "count")].values,
deg=1)
print("Slope:",slope)
print("Intercept",intercept)
The correlation of the above plot is 0.64. Since the population count is positively correlated with the area of the zip code, the number of pothole claims is correlated with the neighborhood area. The plot below summarises the correlation between the population density and the number of claims.
fig, ax = plt.subplots(figsize=(16,6))
sns.regplot(x=pop_claim_df['area_acres'],y=pop_claim_df[("CASE_ENQUIRY_ID", "count")],color="tomato")
ax.set_title("Zip code area vs number of pothole claims")
ax.set(xlabel="Zip code area [acres]",ylabel='Number of claim')
plt.savefig('./Figures/14_Area_Count.jpg',bbox_inches='tight');
slope,intercept = np.polyfit(pop_claim_df['area_acres'].values,
pop_claim_df[("CASE_ENQUIRY_ID", "count")].values,
deg=1)
print("Slope:",slope)
print("Intercept",intercept)
In the above plot, the correlation factor is 0.42.
fig, ax = plt.subplots(figsize=(16,6))
sns.regplot(x=pop_claim_df['population'],y=pop_claim_df[("time_repair", "median")],color="darkviolet")
ax.set_title("Neighborhood population vs median repair time")
ax.set(xlabel="Neighborhood population",ylabel='Median repair time [days]')
plt.savefig('./Figures/15_Time_Pop.jpg',bbox_inches='tight');
slope,intercept = np.polyfit(pop_claim_df['population'].values,
pop_claim_df[("time_repair", "median")].values,
deg=1)
print("Slope:",slope)
print("Intercept",intercept)
Interestingly, while the data presented in the plot above is well spread around the regression line, we can see that there is a negative correlation (-0.31) between the population size and the median repair time.
In this section, we will try to understand if more potholes claimed are reported in certain neighborhoods. We will normalize the data by dividing the number of claims by 100 inhabitants.
# Create a data frame that contains the number of claim per year per zip code
claim_per_zip_df = potholes_df.copy()
claim_per_zip_df.LOCATION_ZIPCODE=claim_per_zip_df.LOCATION_ZIPCODE.astype(int)
claim_per_zip_df = claim_per_zip_df[['CASE_ENQUIRY_ID','LOCATION_ZIPCODE']].groupby('LOCATION_ZIPCODE').count()
# Merge the claim_per_zip_df with the boston_zip_df in order to normalize the number of cases per 100 people
claim_per_zip_df = claim_per_zip_df.merge(boston_zip_df,
left_index=True,
right_index=True,
how='left')
claim_per_zip_df.reset_index(drop=False,inplace=True)
# Create the new feature
claim_per_zip_df['pothole_density'] = claim_per_zip_df['CASE_ENQUIRY_ID']/claim_per_zip_df['population']*100
claim_per_zip_df.LOCATION_ZIPCODE = "0"+claim_per_zip_df.LOCATION_ZIPCODE.astype(str)
claim_per_zip_df.head()
In order to adapt the density plot to the range of values, we first identify the quantiles of the distribution.
claim_per_zip_df.pothole_density.describe()
Upon review of the values presented above, it seems that the data is skewed. Several of the records (potential outliers) have a high pothole claim number to population ratio.
sns.set_context("notebook", font_scale=1.3, rc={"lines.linewidth": 2.2})
fig, ax = plt.subplots(figsize=(18,6))
sns.boxplot(x='pothole_density',
data=claim_per_zip_df,
palette="Set2")
sns.swarmplot(x='pothole_density',
data=claim_per_zip_df,
color=".25")
ax.set(xlabel="potholes per 100 people")
plt.title('Number of pothole claims per 100 people per neighborhood', fontsize=14);
Per the boxplot shown above, it seems that three zipcodes have a much higher ratio than the other. In order to draw a conclusion, we will now plot the results on a map.
# Create map object, set location and zoom
map_density = folium.Map(location=[42.357554, -71.063913],zoom_start=11)
threshold_scale = [3,5,7,10,20,50]
map_density.choropleth(geo_path='./Original Data/Map per zip/Zip_Codes.json',
data = claim_per_zip_df,
columns=['LOCATION_ZIPCODE','pothole_density'],
key_on='feature.properties.ZIP5',
fill_color='YlOrRd',
threshold_scale=threshold_scale,
legend_name = "Pothole claims per 100 inhabitants")
map_density
#List of the three outliers
claim_per_zip_df.sort_values(by='pothole_density',ascending=False).head(5)
Now that we have targeted the outliers, the goal is to understand why they have such a high ration. First, we will investigate their population density. Indeed, if a neighborhood does not have many people living in but many working in, the roads will be subjected to high traffic while the population count would remain low. The first feature they have in common is their locations, all three are located in the center of the financial district. These neighborhoods are known for their old, narrow streets.
boston_zip_df[boston_zip_df.index.isin(claim_per_zip_df.sort_values(by='pothole_density',ascending=False).head(3).LOCATION_ZIPCODE.values)]
boston_zip_df.population_density.sort_values(ascending=False)
Based on the population density ranking, zip code 02210 and 02110 appear to be located in the bottom half of the table in term of population density. In conclusion, while having fewer people living in these areas, these two neighborhoods are located at the intersection of the South of Boston and the cities of Cambridge and Somerville (both located North of the Charles river).
We saw that discrepancies appear when comparing the number of claims per neighborhoods. We will now refine the study of the number of claims distribution by looking at a finer mesh. The size of the mesh is based on the range of latitude and longitude of the pothole claims.
fig, ax = plt.subplots(figsize=(16*1.1469,16))
plt.hist2d(x=potholes_df['LONGITUDE'],y=potholes_df['LATITUDE'],bins=(50*1.1469,50),cmap='Reds')
plt.colorbar()
ax.set_title("Distribution of the number of claims (2011 to 2017)")
ax.set_xlabel("Longitude")
ax.set_ylabel("Latitude")
plt.savefig('./Figures/15_Fine_Mesh.jpg',bbox_inches='tight');
plt.show();
As shown above, they are several locations with a high number of claims. They correspond to the West Boston area and the city historical center.
In this section, we will try to understand if more the repair delay varies between areas.
# Create a data frame that contains the number of claim per year per zip code
repair_time_zip_df = potholes_df.copy()
repair_time_zip_df.LOCATION_ZIPCODE=repair_time_zip_df.LOCATION_ZIPCODE.astype(int)
# As previously explained, we will focus on repair that took more that 12h to occur
repair_time_zip_df = repair_time_zip_df[repair_time_zip_df.time_repair>=0.5]
# Normalize the zip code feature repair_time_zip_df
repair_time_zip_df.LOCATION_ZIPCODE = "0"+repair_time_zip_df.LOCATION_ZIPCODE.astype(str)
repair_time_zip_df = repair_time_zip_df[['time_repair','LOCATION_ZIPCODE']].groupby('LOCATION_ZIPCODE').median()
repair_time_zip_df.reset_index(drop=False,inplace=True)
repair_time_zip_df.head()
repair_time_zip_df.time_repair.describe()
This time, the distribution seems a lot less spread.
sns.set_context("notebook", font_scale=1.3, rc={"lines.linewidth": 2.2})
fig, ax = plt.subplots(figsize=(18,6))
sns.boxplot(x='time_repair',
data=repair_time_zip_df,
palette="Set2")
sns.swarmplot(x='time_repair',
data=repair_time_zip_df,
color=".25")
ax.set(xlabel="Median repair time [days]")
plt.title('Median repair time per area', fontsize=14);
# Create map object, set location and zoom
map_repair = folium.Map(location=[42.357554, -71.063913],zoom_start=11)
threshold_scale = [1,1.22,1.4,1.6,1.8,2]
map_repair.choropleth(geo_path='./Original Data/Map per zip/Zip_Codes.json',
data = repair_time_zip_df,
columns=['LOCATION_ZIPCODE','time_repair'],
key_on='feature.properties.ZIP5',
fill_color='YlOrRd',
threshold_scale=threshold_scale,
legend_name = "Repair time in days.")
map_repair
repair_time_zip_df.sort_values('time_repair')
We already stated that the data range is small but the map above shows a clear split between the northen most areas and the shouthern most ones. We will try to understand this trend by plotting the population density on a map.
# Create map object, set location and zoom
map_pop_density = folium.Map(location=[42.357554, -71.063913],zoom_start=11)
claim_per_zip_df.population_density=claim_per_zip_df.population_density.divide(1000)
threshold_scale = [claim_per_zip_df.population_density.quantile(0.20),
claim_per_zip_df.population_density.quantile(0.40),
claim_per_zip_df.population_density.quantile(0.60),
claim_per_zip_df.population_density.quantile(0.80),
claim_per_zip_df.population_density.quantile(1.00)]
map_pop_density.choropleth(geo_path='./Original Data/Map per zip/Zip_Codes.json',
data = claim_per_zip_df,
columns=['LOCATION_ZIPCODE','population_density'],
key_on='feature.properties.ZIP5',
fill_color='BuPu',
legend_name = "Population density (in thousands)",
threshold_scale=threshold_scale)
map_pop_density
The map shown above depicts the population density per neighborhood. As we can see, there is a higher population density for the neighborhoods located near the heart of Boston. However, the study of the population density is not enough to explain the discrepancy between the repair delay. The delay to fix the potholes is probably more related to the organization of the Department of Transportation and how its teams are assigned to districts.
We saw that discrepancies appear when comparing the median repair per neighborhoods. We will now refine the study of the number of claims distribution by looking at a finer mesh. The size of the mesh is based on the range of latitude and longitude of the pothole claims.
Performing a detailed case study of all the neighborhood is not convenient and would not provide a good insight. We will choose 3 neighborhoods based on the study done so far. The selected neighborhoods are the followings:
selected_zip = ['02114','02210','02113']
# List comprehension to contain the coordinates of the missing locations as list of lists.
selected_zip_df = potholes_df[potholes_df.LOCATION_ZIPCODE.isin(selected_zip)]
selected_zip_map = [[x,y] for x,y in
zip(selected_zip_df.LATITUDE.values,selected_zip_df.LONGITUDE.values)]
# Create map object
selected_zip_map = folium.Map(location=[42.357554, -71.063913],zoom_start=12)
# Create markers and plot map
selected_zip_map.add_child(folium.GeoJson(data=open('./Original Data/Map per zip/map.geojson'),
name='Zip codes',
style_function=lambda x: {'fillColor':'red' if x['properties']['ZIP5'] in selected_zip
else 'white','fillOpacity' : 0.5,'weight' : 1,'color':'black'}))
selected_zip_map
We extract the data for these zip codes and look at the variation of the number of claims and repair time during the years.
# Filtered the data to only cover the selected zipcodes
filtered_potholes_df = potholes_df[potholes_df.LOCATION_ZIPCODE.isin(selected_zip)][['OPEN_DT','CASE_ENQUIRY_ID','LOCATION_ZIPCODE']]
# Set all the dates to the first of the month
filtered_potholes_df.OPEN_DT = filtered_potholes_df.OPEN_DT.apply(lambda x: x.replace(day=1)).dt.date
# Group by date and zip code
filtered_potholes_df = filtered_potholes_df.groupby(['OPEN_DT','LOCATION_ZIPCODE']).count()
filtered_potholes_df.reset_index(drop=False,inplace=True)
# Merge pothole and city data
filtered_potholes_df = filtered_potholes_df.merge(right=boston_zip_df,how='left',left_on="LOCATION_ZIPCODE",right_index=True)
filtered_potholes_df.tail()
# Create claims per 100 people
filtered_potholes_df['pothole_density'] = filtered_potholes_df['CASE_ENQUIRY_ID']/filtered_potholes_df['population']*100
filtered_potholes_df.columns
# Set main plot parameters
sns.set_style("whitegrid")
sns.set_context("notebook", font_scale=1.3, rc={"lines.linewidth": 2.2})
# Create x labels using list comprehension
x_tick = [x.year if x.day==1 and x.month==1 else '' for x in filtered_potholes_df.OPEN_DT.unique()]
# Plot
fig, ax = plt.subplots(figsize=(16,8))
ax = sns.pointplot(x=filtered_potholes_df['OPEN_DT'],
y=filtered_potholes_df['pothole_density'],hue=filtered_potholes_df['LOCATION_ZIPCODE'].astype(int),
palette="Set2")
# Set plot labels
ax.set_title("Zip code, Number of claims per zipcode per 100 people")
ax.set(xlabel="",ylabel='Claim count')
ax.set_xticklabels(x_tick)
plt.show()
Interestingly, the zip code 2210 (with the largest pothole to people ratio) contains many months without claims.
# Filtered the data to only cover the selected zipcodes
filtered_repair_df = potholes_df[potholes_df.LOCATION_ZIPCODE.isin(selected_zip)][['OPEN_DT','time_repair','LOCATION_ZIPCODE']]
filtered_repair_df = filtered_repair_df[filtered_repair_df.time_repair>=0.5]
# Set all the dates to the first of the month
filtered_repair_df.OPEN_DT = filtered_repair_df.OPEN_DT.apply(lambda x: x.replace(day=1)).dt.date
# Group by date and zip code
filtered_repair_df = filtered_repair_df.groupby(['OPEN_DT','LOCATION_ZIPCODE']).count()
filtered_repair_df.reset_index(drop=False,inplace=True)
# Set main plot parameters
sns.set_style("whitegrid")
sns.set_context("notebook", font_scale=1.3, rc={"lines.linewidth": 2.2})
# Create x labels using list comprehension
x_tick = [x.year if x.day==1 and x.month==1 else '' for x in filtered_repair_df.OPEN_DT.unique()]
# Plot
fig, ax = plt.subplots(figsize=(16,8))
ax = sns.pointplot(x=filtered_repair_df['OPEN_DT'],
y=filtered_repair_df['time_repair'],hue=filtered_repair_df['LOCATION_ZIPCODE'].astype(int),
palette="Set2")
# Set plot labels
ax.set_title("Median repair time in days for the selected zip codes")
ax.set(xlabel="",ylabel='Median repair time in days')
ax.set_xticklabels(x_tick)
plt.show()
The median repair time for the zip code 02114 is extremely volatile. Moreover, it has been increasing on average over the last couple years. A change in the department or funding can explain this trend as the number of potholes has not significantly increased over the same period of time.
The original questions that were asked at the beginning of this study were the followings:
After a detailed review and evaluation of the three sets of data, the following answers can be provided:
Final words:
The purpose of this report was to present the different steps that lead to the understanding of a problem. Obviously, more questions can be asked regarding this complex topic. We leave room for more exploration in the full project.
As part of the final capstone project report, the following will also be included:
Documentation
import sys
print(sys.version)
print("Pandas:",pd.__version__)
print("Numpy:",np.__version__)
print("Matplotlib:",matplotlib.__version__)
print("Seaborn:",sns.__version__)
print("Scipy:",scipy.__version__)
print("Folium:",folium.__version__)